Goto

Collaborating Authors

 slide deck


Advancing Academic Chatbots: Evaluation of Non Traditional Outputs

Favero, Nicole, Salute, Francesca, Hardt, Daniel

arXiv.org Artificial Intelligence

Most evaluations of large language models focus on standard tasks such as factual question answering or short summarization. This research expands that scope in two directions: first, by comparing two retrieval strategies, Graph RAG, structured knowledge-graph based, and Advanced RAG, hybrid keyword-semantic search, for QA; and second, by evaluating whether LLMs can generate high quality non-traditional academic outputs, specifically slide decks and podcast scripts. We implemented a prototype combining Meta's LLaMA 3 70B open weight and OpenAI's GPT 4o mini API based. QA performance was evaluated using both human ratings across eleven quality dimensions and large language model judges for scalable cross validation. GPT 4o mini with Advanced RAG produced the most accurate responses. Graph RAG offered limited improvements and led to more hallucinations, partly due to its structural complexity and manual setup. Slide and podcast generation was tested with document grounded retrieval. GPT 4o mini again performed best, though LLaMA 3 showed promise in narrative coherence. Human reviewers were crucial for detecting layout and stylistic flaws, highlighting the need for combined human LLM evaluation in assessing emerging academic outputs.


SlideAgent: Hierarchical Agentic Framework for Multi-Page Visual Document Understanding

Jin, Yiqiao, Kaur, Rachneet, Zeng, Zhen, Ganesh, Sumitra, Kumar, Srijan

arXiv.org Artificial Intelligence

Multi-page visual documents such as manuals, brochures, presentations, and posters convey key information through layout, colors, icons, and cross-slide references. While large language models (LLMs) offer opportunities in document understanding, current systems struggle with complex, multi-page visual documents, particularly in fine-grained reasoning over elements and pages. We introduce SlideAgent, a versatile agentic framework for understanding multi-modal, multi-page, and multi-layout documents, especially slide decks. SlideAgent employs specialized agents and decomposes reasoning into three specialized levels-global, page, and element-to construct a structured, query-agnostic representation that captures both overarching themes and detailed visual or textual cues. During inference, SlideAgent selectively activates specialized agents for multi-level reasoning and integrates their outputs into coherent, context-aware answers. Extensive experiments show that SlideAgent achieves significant improvement over both proprietary (+7.9 overall) and open-source models (+9.8 overall).


Culturally-Aware Conversations: A Framework & Benchmark for LLMs

Havaldar, Shreya, Rai, Sunny, Cho, Young-Min, Ungar, Lyle

arXiv.org Artificial Intelligence

Existing benchmarks that measure cultural adaptation in LLMs are misaligned with the actual challenges these models face when interacting with users from diverse cultural backgrounds. In this work, we introduce the first framework and benchmark designed to evaluate LLMs in realistic, multicultural conversational settings. Grounded in sociocultural theory, our framework formalizes how linguistic style - a key element of cultural communication - is shaped by situational, relational, and cultural context. We construct a benchmark dataset based on this framework, annotated by culturally diverse raters, and propose a new set of desiderata for cross-cultural evaluation in NLP: conversational framing, stylistic sensitivity, and subjective correctness. We evaluate today's top LLMs on our benchmark and show that these models struggle with cultural adaptation in a conversational setting.


Auto-Slides: An Interactive Multi-Agent System for Creating and Customizing Research Presentations

Yang, Yuheng, Jiang, Wenjia, Wang, Yang, Wang, Yiwei, Zhang, Chi

arXiv.org Artificial Intelligence

The rapid progress of large language models (LLMs) has opened new opportunities for education. While learners can interact with academic papers through LLM-powered dialogue, limitations still exist: absence of structured organization and high text reliance can impede systematic understanding and engagement with complex concepts. To address these challenges, we propose Auto-Slides, an LLM-driven system that converts research papers into pedagogically structured, multimodal slides (e.g., diagrams and tables). Drawing on cognitive science, it creates a presentation-oriented narrative and allows iterative refinement via an interactive editor, in order to match learners' knowledge level and goals. Auto-Slides further incorporates verification and knowledge retrieval mechanisms to ensure accuracy and contextual completeness. Through extensive user studies, Auto-Slides enhances learners' comprehension and engagement compared to conventional LLM-based reading. Our contributions lie in designing a multi-agent framework for transforming academic papers into pedagogically optimized slides and introducing interactive customization for personalized learning.


Intent Tagging: Exploring Micro-Prompting Interactions for Supporting Granular Human-GenAI Co-Creation Workflows

Gmeiner, Frederic, Marquardt, Nicolai, Bentley, Michael, Romat, Hugo, Pahud, Michel, Brown, David, Roseway, Asta, Martelaro, Nikolas, Holstein, Kenneth, Hinckley, Ken, Riche, Nathalie

arXiv.org Artificial Intelligence

Despite Generative AI (GenAI) systems' potential for enhancing content creation, users often struggle to effectively integrate GenAI into their creative workflows. Core challenges include misalignment of AI-generated content with user intentions (intent elicitation and alignment), user uncertainty around how to best communicate their intents to the AI system (prompt formulation), and insufficient flexibility of AI systems to support diverse creative workflows (workflow flexibility). Motivated by these challenges, we created IntentTagger: a system for slide creation based on the notion of Intent Tags - small, atomic conceptual units that encapsulate user intent - for exploring granular and non-linear micro-prompting interactions for Human-GenAI co-creation workflows. Our user study with 12 participants provides insights into the value of flexibly expressing intent across varying levels of ambiguity, meta-intent elicitation, and the benefits and challenges of intent tag-driven workflows. We conclude by discussing the broader implications of our findings and design considerations for GenAI-supported content creation workflows.


ChatBCG: Can AI Read Your Slide Deck?

Singh, Nikita, Balian, Rob, Martinelli, Lukas

arXiv.org Artificial Intelligence

With the advanced vision capabilities of GPT-4o and Gemini Flash, an important question arises regarding the accuracy of these functionalities in practical business applications. Our assumption was that multimodal models are good at reading and summarizing charts. When given an image of a slide deck, they do a good job of summarizing key insights from it, often including relevant data points. Existing research into this question has evaluated the efficacy of LLM's when parsing tables [3], concluding that the LLMs were highly sensitive to input prompts which drive performance. Other works also evaluate LLMs ability to reason and read mathematical graphs [2] and find that GPT models outperform alternatives. This paper aims to explore whether multimodal models perform well on a variant of this skill - answering straightforward questions that require the models to pick out a number from a slide deck.


Google's new AI video generator is more HR than Hollywood

Engadget

For most of us, creating documents, spreadsheets and slide decks is an inescapable part of work life in 2024. What's not is creating videos. That's something Google would like to change. On Tuesday, the company announced Google Vids, a video creation app for work that the company says can make everyone a "great storyteller" using the power of AI. Vids uses Gemini, Google's latest AI model, to quickly create videos for the workplace.


Here's How AI Will Come for Your Job

The Atlantic - Technology

Abandon all hope, ye who merge spreadsheet cells! Last week, at its annual I/O conference, Google spent hours detailing how large language models would help the knowledge workers of the world unload their busywork onto a legion of eager, capable neural networks. The company will soon introduce AI functions into programs such as Gmail, Google Sheets, and Google Slides that will allow users to type simple commands and receive complex outputs: entire email compositions, for example, or auto-generated tables. The future that Google is promising feels familiar--it's all about heightened convenience and one-click efficiency--and I hate it. Workplace AI feels like the purest distillation of a corrosive ideology that demands frictionless productivity from workers: The easier our labor becomes, the more of it we can do, and the more of it we'll be expected to do.


SlideVQA: A Dataset for Document Visual Question Answering on Multiple Images

Tanaka, Ryota, Nishida, Kyosuke, Nishida, Kosuke, Hasegawa, Taku, Saito, Itsumi, Saito, Kuniko

arXiv.org Artificial Intelligence

Visual question answering on document images that contain textual, visual, and layout information, called document VQA, has received much attention recently. Although many datasets have been proposed for developing document VQA systems, most of the existing datasets focus on understanding the content relationships within a single image and not across multiple images. In this study, we propose a new multi-image document VQA dataset, SlideVQA, containing 2.6k+ slide decks composed of 52k+ slide images and 14.5k questions about a slide deck. SlideVQA requires complex reasoning, including single-hop, multi-hop, and numerical reasoning, and also provides annotated arithmetic expressions of numerical answers for enhancing the ability of numerical reasoning. Moreover, we developed a new end-to-end document VQA model that treats evidence selection and question answering in a unified sequence-to-sequence format. Experiments on SlideVQA show that our model outperformed existing state-of-the-art QA models, but that it still has a large gap behind human performance. We believe that our dataset will facilitate research on document VQA.


New DALL-E integration adds generative AI for next-level slides

#artificialintelligence

Check out the on-demand sessions from the Low-Code/No-Code Summit to learn how to successfully innovate and achieve efficiency by upskilling and scaling citizen developers. For Tome, which calls itself the "new storytelling format for work and important ideas," integrating OpenAI's DALL-E into its flexible, interactive slide options -- which it announced today -- was a natural fit to add a generative AI dimension to decks. When OpenAI announced the release of the DALL-E API in early November, the San-Francisco-based startup had its chance. "Making that a part of the storytelling creation experience just felt really natural," Tome CEO Keith Peiris told VentureBeat. "It felt so much more powerful than looking for a stock photo or clip art -- it's kind of giving us a first look at what generative storytelling can look like."